Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web-analytics): Add backend version of web analytics queries #17629

Merged
merged 13 commits into from
Oct 2, 2023

Conversation

robbie-c
Copy link
Member

@robbie-c robbie-c commented Sep 26, 2023

Problem

Follow on from #17626

Changes

Move the Top Sources query to the back end

Screenshot 2023-09-26 at 17 34 44

How did you test this code?

(Behind FF) ran it manually

@posthog-bot
Copy link
Contributor

Hey @robbie-c! 👋
This pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

@robbie-c robbie-c changed the base branch from master to add-web-analytics-dashboard-alpha September 26, 2023 15:03
@robbie-c robbie-c force-pushed the add-top-sources-backend-call branch from 106d7b3 to 2ce110e Compare September 26, 2023 15:04
@robbie-c robbie-c changed the title feat(web-analytics): Add top sources backend call feat(web-analytics): Add backend version of top source SQL Sep 26, 2023
Base automatically changed from add-web-analytics-dashboard-alpha to master September 26, 2023 15:11
@robbie-c robbie-c force-pushed the add-top-sources-backend-call branch from 850b612 to 297be07 Compare September 26, 2023 16:31
@@ -5,7 +5,7 @@ import { SSO_PROVIDER_NAMES } from 'lib/constants'
import { preflightLogic } from 'scenes/PreflightCheck/preflightLogic'
import { SSOProvider } from '~/types'

interface SSOSelectInterface {
export interface SSOSelectInterface {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by fix, fixing a TypeScript error in some test code (which therefore doesn't trigger CI failures, but does still show up in my editor)

@robbie-c robbie-c marked this pull request as ready for review September 27, 2023 07:11
@robbie-c robbie-c force-pushed the add-top-sources-backend-call branch from 3436a0b to 789c19d Compare September 27, 2023 14:07
@robbie-c robbie-c changed the title feat(web-analytics): Add backend version of top source SQL feat(web-analytics): Add backend version of web analytics queries Sep 27, 2023
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I left some loose thoughts below regarding code placement, etc.

frontend/src/queries/schema.ts Show resolved Hide resolved
posthog/hogql_queries/query_runner.py Outdated Show resolved Hide resolved
posthog/hogql_queries/query_runner.py Outdated Show resolved Hide resolved
posthog/hogql_queries/query_runner.py Outdated Show resolved Hide resolved
posthog/hogql_queries/query_runner.py Outdated Show resolved Hide resolved
Comment on lines +133 to 147
const results = !response
? null
: 'results' in response && Array.isArray(response.results)
? response.results
: 'result' in response && Array.isArray(response.result)
? response.result
: null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a way to standardise here 🙈

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, hogql and hogql insights were already different here 🙉 I could just change one but I'm a bit scared of sweeping changes like that. Probably safest to change the hogql insights given it's not running in prod yet, and could do that in a different PR to keep this one less noisy.

@robbie-c robbie-c force-pushed the add-top-sources-backend-call branch from 41ee532 to 860ebc9 Compare September 28, 2023 16:26
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more thoughts

posthog/types.py Outdated Show resolved Hide resolved
Comment on lines +19 to +105
"""
WITH

session_cte AS (
SELECT
events.properties.`$session_id` AS session_id,
min(events.timestamp) AS min_timestamp,
max(events.timestamp) AS max_timestamp,
dateDiff('second', min_timestamp, max_timestamp) AS duration_s,

-- create a tuple so that these are grouped in the same order, see https://github.com/ClickHouse/ClickHouse/discussions/42338
groupArray((events.timestamp, events.properties.`$referrer`, events.properties.`$pathname`, events.properties.utm_source)) AS tuple_array,
arrayFirstIndex(x -> tupleElement(x, 1) == min_timestamp, tuple_array) as index_of_earliest,
arrayFirstIndex(x -> tupleElement(x, 1) == max_timestamp, tuple_array) as index_of_latest,
tupleElement(arrayElement(
tuple_array,
index_of_earliest
), 2) AS earliest_referrer,
tupleElement(arrayElement(
tuple_array,
index_of_earliest
), 3) AS earliest_pathname,
tupleElement(arrayElement(
tuple_array,
index_of_earliest
), 4) AS earliest_utm_source,

if(domain(earliest_referrer) = '', earliest_referrer, domain(earliest_referrer)) AS referrer_domain,
multiIf(
earliest_utm_source IS NOT NULL, earliest_utm_source,
-- This will need to be an approach that scales better
referrer_domain == 'app.posthog.com', 'posthog',
referrer_domain == 'eu.posthog.com', 'posthog',
referrer_domain == 'posthog.com', 'posthog',
referrer_domain == 'www.google.com', 'google',
referrer_domain == 'www.google.co.uk', 'google',
referrer_domain == 'www.google.com.hk', 'google',
referrer_domain == 'www.google.de', 'google',
referrer_domain == 't.co', 'twitter',
referrer_domain == 'github.com', 'github',
referrer_domain == 'duckduckgo.com', 'duckduckgo',
referrer_domain == 'www.bing.com', 'bing',
referrer_domain == 'bing.com', 'bing',
referrer_domain == 'yandex.ru', 'yandex',
referrer_domain == 'quora.com', 'quora',
referrer_domain == 'www.quora.com', 'quora',
referrer_domain == 'linkedin.com', 'linkedin',
referrer_domain == 'www.linkedin.com', 'linkedin',
startsWith(referrer_domain, 'http://localhost:'), 'localhost',
referrer_domain
) AS blended_source,

countIf(events.event == '$pageview') AS num_pageviews,
countIf(events.event == '$autocapture') AS num_autocaptures,
-- in v1 we'd also want to count whether there were any conversion events

any(events.person_id) as person_id,
-- definition of a GA4 bounce from here https://support.google.com/analytics/answer/12195621?hl=en
(num_autocaptures == 0 AND num_pageviews <= 1 AND duration_s < 10) AS is_bounce
FROM
events
WHERE
session_id IS NOT NULL
AND
events.timestamp >= now() - INTERVAL 8 DAY
GROUP BY
events.properties.`$session_id`
HAVING
min_timestamp >= now() - INTERVAL 7 DAY
)



SELECT
blended_source,
count(num_pageviews) as total_pageviews,
count(DISTINCT person_id) as unique_visitors,
avg(is_bounce) AS bounce_rate
FROM
session_cte
WHERE
blended_source IS NOT NULL
GROUP BY blended_source

ORDER BY total_pageviews DESC
LIMIT 100
""",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These long queries look like something that takes a long time for HogQL to parse 🙈. 11+sec for me locally. Hopefully the C++ parser will solve this.

Comment on lines +7 to +12
class WebAnalyticsQueryRunner(QueryRunner, ABC):
def _is_stale(self, cached_result_package):
return True

def _refresh_frequency(self):
return BASE_MINIMUM_INSIGHT_REFRESH_INTERVAL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope / random thought: if we're going to add many queries, these should also get defaults in the base runner, and be overidden on a as-needed basis.

frontend/src/queries/nodes/DataTable/DataTable.tsx Outdated Show resolved Hide resolved
posthog/api/query.py Outdated Show resolved Hide resolved
posthog/hogql_queries/web_analytics/top_clicks.py Outdated Show resolved Hide resolved
posthog/hogql_queries/query_runner.py Outdated Show resolved Hide resolved
@@ -6,38 +6,38 @@ import { NodeKind } from '~/queries/schema'
export function WebAnalyticsScene(): JSX.Element {
return (
<div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a <div> tag around all the titles, and replacing this line with

Suggested change
<div>
<div className='space-y-2'>

or so.

Check out the https://storybook.posthog.net/ for more style guidelines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm intentionally not including any styling changes in this PR - that'll be a separate PR. The text here is just for my benefit as I develop it.

from posthog.schema import WebTopPagesQuery, WebTopPagesQueryResponse


class TopPagesQueryRunner(WebAnalyticsQueryRunner):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something with this query is not working for me. I'm getting a Illegal type String of argument for aggregate function avg. error. A quick check locally under data management shows that I don't have the $prev_pageview_max_scroll_percentage property at all, so it's likely treated as a string, not an integer. I wouldn't consider this blocking for an alpha.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have the same problem, I'll need to figure this out before we roll it out to new people. I naively added a toFloat around this, which didn't work either, but I'm sure there must be something I can do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a very naive toString -> toFloat might work?
The alternative is to make sure that the required property definitions exist with the right types, when loading the page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer was JSONExtractRaw -> toFloat, will be in a separate PR

@robbie-c robbie-c force-pushed the add-top-sources-backend-call branch from 80d0191 to 9fa716a Compare September 29, 2023 13:56
@robbie-c robbie-c force-pushed the add-top-sources-backend-call branch from 9fa716a to 9698c04 Compare September 29, 2023 14:47
@robbie-c robbie-c requested a review from mariusandra October 2, 2023 07:58
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nits, non blocking.

class TestSubclassQueryRunner(TestQueryRunner): # type: ignore
pass

team = Team.objects.create(pk=42, organization=self.organization)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know when and where we're running these tests. pk=42 might already exist. Why not leave it empty and get an autoassigned one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment. The first one of these tests already existed, I'd just copied it and adapted it for a slightly different case.

Copy link
Collaborator

@mariusandra mariusandra Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I guess the ID is hardcoded to have a stable cache key. So all ok 🤷

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my last nit here is that if every other file in the hogql_queries folder follows the "filename is an underscore version of the file's main class's name", then what excuse do these three files have to be different? "Not enough disk space" is not a right answer. "It was faster to write it if it's shorter" is also the wrong answer.

A good book I recommend to people with regards to website usability is called "Don't make me think". The main premise is that if you force your users to stop and think, breaking them out of an intuitive flow, your UX is bad. Same here: I need to think why these are different.

This is not blocking, but just causes me to stop and 🤔. Best case we ignore this, worse case, someone 6 months from now, will also take time to ponder the inconsistency, and then spend time making a PR to rename them, and will take someone else's time to review the PR.

Copy link
Member Author

@robbie-c robbie-c Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was

  • Can parse it visually very easily. Something I didn't like before was that query_runner, lifecycle_query_runner, insight_query_runner, etc, all looked very similar and it wasn't immediate which file was the special one that contained the base class, vs all of the files that contained a subclass
  • It's already in the web analytics folder, so the filename doesn't need to include the words web analytics
  • It's useful for the class names to keep the words web analytics, because they're used outside of this folder, and it provides context to the person reading the code that uses them.

@robbie-c robbie-c merged commit 217496f into master Oct 2, 2023
@robbie-c robbie-c deleted the add-top-sources-backend-call branch October 2, 2023 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants